sprint2:NLP baselines (BoW/TF-IDF + LogReg/SVM) #358

Tanvir-ctrl1 · 2025-09-18T19:46:18Z

Description

Sprint-2 adds classical NLP baselines for the Emotion/State notes dataset.

What’s included

Methods

Features: BoW(1), TF-IDF(1–2), TF-IDF(1–3) + optional lexicon counts
Models: Logistic Regression & Linear SVM (class_weight="balanced")
Leakage guard: removes label tokens (normal/sick/uncomfortable + variants) via custom preprocessor + stopword list, with a tripwire that fails if any leak into the vocab
Group-aware split (GroupShuffleSplit) + 5-fold GroupKFold CV; dedup after scrubbing

Headline results (use CV as primary)

Best CV: TF-IDF(1–2) + LinearSVM ≈ 0.98 accuracy, 0.956 ± 0.032 macro-F1
Single split shown in confusion_matrix.png (informational)

Tested and working locally (sprint2nlp conda env)
Code follows project style (black-ish formatting, no notebooks committed for pipeline)
Self-reviewed; checked feature preview for label leakage
Documentation: sprint2_report.md + settings.json
Request review from 2 devs (ML + app)

Ensure conda env has sklearn/pandas/numpy/scipy/matplotlib.
```
conda activate sprint2nlp
```

sprint2:NLP baselines (BoW/TF-IDF + LogReg/SVM)

4bd2ac7